Towards the adaptation of prosodic models for expressive text-to-speech synthesis
نویسندگان
چکیده
This paper presents a preliminary study whose main aim is to characterize four distinct speaking styles according to a limited set of prosodic features, including the length of prosodic phrases (AP and IP), the distribution of stressed syllables, pitch register span, the duration of silent pauses, etc. The analysis was performed using semi-automatic procedures on a corpus consisting of 30 minutes of speech per style. The study focuses on four styles, all of which are “overtly addressed to a given audience”, but differ as to the nature of the audience (adults vs. children) and the desired impact of the address (“importance of being understood and convincing, or not”). Data analysis reveals that (a) dictation (addressed to children) and political speeches (addressed to adults) are different to the two other speaking styles (reading of novels and fairy tales) with respect to a specific set of prosodic cues; while (b) the speeches addressed to children differ from the ones addressed to adults, with respect to another set of prosodic cues (especially pitch register span). These results have an interesting practical application: refining the design of pre-processing prosodic modules in a text-to-speech system, in order to improve the expressivity of synthesized speech.
منابع مشابه
Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech
Expressive speech synthesis has received increased attention in recent times. Stress (or pitch accent) is the perceptual prominence within words or utterances, which contributes to the expressivity of speech. This paper summarizes our contribution to Mandarin expressive speech synthesis. A novel hierarchical stress modeling and generation method for Mandarin is proposed and further integrated i...
متن کاملHmm-based Expressive Speech Synthesis —towards Tts with Arbitrary Speaking Styles and Emotions
This paper describes recent progress in our approach to generating expressive speech. A goal of text-to-speech (TTS) synthesis is to have an ability to generate natural sounding speech with arbitrary speaker’s voice characteristics, speaking styles and emotional expressions. To change voice and speaking style and/or emotion of the synthetic speech arbitrarily with maintaining its naturalness, i...
متن کاملModeling the acoustic correlates of expressive elements in text genres for expressive text-to-speech synthesis
This paper proposes a novel approach for describing the expressive elements in text genres and modeling their acoustic correlates for expressive text-to-speech synthesis (TTS). We apply the three-dimensional PAD (pleasure-displeasure, arousal-nonarousal and dominance-submissiveness) model in describing expressivity. In particular, we define a set of principles for annotating the P and A values ...
متن کاملTwo-stage prosody prediction for emotional text-to-speech synthesis
In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the loca...
متن کاملطراحی و ارزیابی یک مدل بازسازی گفتار به روش همگذاری واحدهای حساس به بافت نوایی
This paper describes the design and evaluation of prosodically-sensitive concatenative units for a Persian text-to-speech (TTS) synthesis system. Thesyllables used are prosodically conditioned in the sense that a single conventional syllable is stored as different versions taken directly from the different prosodic domains of the prosodically labeled, read sentences. The three levels of the Per...
متن کامل